Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl#7664
Conversation
c8e640f to
ac120e9
Compare
This comment was marked as resolved.
This comment was marked as resolved.
[un]pack4x{I, U}8[Clamp] on spv and msl[un]pack4x{I, U}8[Clamp] on spv and msl
a136017 to
659df18
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
FWIW, the optimization of |
659df18 to
5ba088c
Compare
|
I squashed some fixes to make this PR easier to review. Each commit should now separately pass CI. |
teoxoy
left a comment
There was a problem hiding this comment.
Looks good overall, left a few small comments.
|
Thank you so much for the review, this is great! I addressed your comments. |
This is blocked on #4582. |
|
Thanks for addressing the comments, let's land it! |
|
Ah it would probably be nice to land this without squashing, could you move the code changes in the last commit to their respective commits? |
Emits vectorized SPIR-V code for the WGSL functions `unpack4xI8`, `unpack4xU8`, `pack4xI8`, `pack4xU8`, `pack4xI8Clamp`, and `pack4xU8Clamp` if `Capability::Int8` is available. Exploits the following facts about SPIR-V ops: - `SClamp`, `UClamp`, and `OpUConvert` accept vector arguments, in which case results are computed per component; and - `OpBitcast` can cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec, see below. WGSL spec for `pack4xI8` [1]: > Component e[i] of the input is mapped to bits 8 x i through 8 x i + 7 > of the result. SPIR-V spec for `OpBitcast` [2]: > Within this mapping, any single component of `S` [remark: the type > with fewer but wider components] (mapping to multiple components of > `L` [remark: the type with more but narrower components]) maps its > lower-ordered bits to the lower-numbered components of `L`. [1] https://www.w3.org/TR/WGSL/#pack4xI8-builtin [2] https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpBitcast
Implements more direct conversions between 32-bit integers and 4x8-bit
integer vectors using bit casting to/from `packed_[u]char4` when on
MSL 2.1+ (older versions of MSL don't seem to support these bit casts).
- `unpack4x{I, U}8(x)` becomes `[u]int4(as_type<packed_[u]char4>(x))`;
- `pack4x{I, U}8(x)` becomes `as_type<uint>(packed_[u]char4(x))`; and
- `pack4x{I, U}8Clamp(x)` becomes
`as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255)))`.
These bit casts match the WGSL spec for these functions because Metal
runs on little-endian machines.
Separates the Vulkan feature sets
`VkPhysicalDeviceShaderFloat16Int8Features` and
`VkPhysicalDevice16BitStorageFeatures`, which previously were used
"together, or not at all".
This commit should not change any behavior yet, but I'd like to run full
CI tests on it for now. If the CI tests pass, I'll use this separation
to enable the `shader_int8` feature separately from the rest of the
features to enable optimizations of `[un]pack4x{I,U}8[Clamp]` on SPIR-V.
This allows declaring the SPIR-V capability "Int8", which allows us to
generate faster code for `[un]pack4x{I, U}8[Clamp]`.
4d633d2 to
871d295
Compare
|
No problem, done. |
|
Thank you! |
Connections
Related to, but independent of:
dot4{I, U}8Packedon SPIR-V and HLSL #7574dot4{I,U}8Packedon Metal #7653Description
Emits simpler code (using vectorized ops) for
unpack4xI8,unpack4xU8,pack4xI8,pack4xU8,pack4xI8Clamp, andpack4xU8Clampon SPIR-V (if capability "Int8" is available) and Metal (with MSL 2.1+).SClamp,UClamp, andOpUConvertaccept vector arguments (in which case results are computed per component), and thatOpBitcastcan cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec (see below for details).unpack4x{I, U}8(x)→[u]int4(as_type<packed_[u]char4>(x)),pack4x{I, U}8(x)→as_type<uint>(packed_[u]char4(x)),pack4xI8Clamp(x)→as_type<uint>(packed_char4(metal::clamp(x, -128, 127))), andpack4xU8Clamp(x)→as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255))).Regarding byte order (this was not relevant for the dot product in #7653 but is relevant now):
Testing
wgpu_gpu::shader::data_builtins::create_unpack4xU8_teston both the optimized and the unoptimized generated code. I'm not sure how to do this but I'd welcome any hints.Squash or Rebase?
Each commit should pass CI.
Checklist
cargo fmt.taplo format.cargo clippy --tests. If applicable, add:--target wasm32-unknown-unknowncargo xtask testto run tests.CHANGELOG.mdentry.